Reading an arbitrarily long line in C

Unfortunately there's no easy, standard way to read in all of an arbitrarily long line of text in C. You might think of using (f)scanf or gets, but both of these have their drawbacks.

For a start, gets doesn't check the size of the buffer you give it, so it's easy to get a buffer overflow. For this reason never ever use gets!

Similarly, naively using the %s format with (f)scanf can easily lead to buffer overflow. You can get around this using something like %20s to tell it to read in only so many characters, but this unfortunately reads in exactly that many characters, so it's not very flexible.

There are a few solutions to this. Here are some of them, in decreasing order of convenience and (approximately) increasing order of portability:

Using the GNU %as extension to fscanf

Normally the %s format causes fscanf to read in an arbitrarily long string. As has been mentioned, this is insecure and should not be used. However, if you are using GNU libc, you can make use of the %as format. This is exactly the same as %s except that

  • the function allocates as much memory as needed for the string; and
  • the corresponding pointer argument will have type char ** rather than char * (since the function needs to change what the argument points to).

So you might use something like:

#include <stdio.h>

char * getline(FILE * f)
{
char * buf;
int result = fscanf(f,"%as\n",&buf);
if (result < 0) {
if (NULL != buf) free(buf);
return NULL;
}
return buf;
}

This is a very handy mechanism. The major disadvantage to it is that it is completely incompatible with ANSI and ISO standards (in particular, %a means something different in C99). Unless you are sure you're using GNU libc and you don't mind breaking portability (usually not a good idea), you shouldn't use this.

Using the GNU readline library

readline is an extremely handy library for accepting user input. Not only is it easy to use from the programmer's perspective, it provides the user with command line editing such as you would expect in e.g. the bash shell. In fact, bash uses readline for user input, as do many other programs, so you may well already be familiar with it.

There is a lot of scope for customising readline, but for basic usage all you need is something like:

#include <stdio.h>
#include <readline/readline.h>

char * foo()
{
return readline("Prompt: ");
}

Remember to link your executable to libreadline and libtermcap (with gcc, this means using the linker switches -lreadline -ltermcap), and make sure the readline header files are in your include path. Most modern Linux distributions have readline installed by default, though you may need to install the headers yourself (they are usually in a package called something like readline-devel).

The only drawbacks to readline are:

  • it's rather big - it has lots of features you may well not need
  • the person running your code may not have readline installed. You can get around this by statically linking the library, but remember it is rather big (around 167 KB for version 4.3)
  • it's only suitable for interactive input from a terminal

Using the GNU getline function

getline is a function added to the GNU version of libc to address the very problem this article discusses. For terminal input readline is easier to use and arguably more portable, so if you are only accepting terminal input, use that instead.

Here's the kind of thing you'll need to do to use getline:

#define _GNU_SOURCE
#include <stdio.h>

char * foo(FILE * f)
{
int n = 0, result;
char * buf;

result = getline(&buf, &n, f);
if (result < 0) return NULL;
return buf;
}

getline is an extension to the stdio library, so again it is only available if you can rely on the presence of GNU libc. It's not a good idea to statically link it as that will mean your entire code is statically linked - usually you want to link standard libraries dynamically. Moreover, updates to libc won't affect a statically linked version.

Another disadvantage is that it trades simplicity for flexibility, so it is not quite as easy to use as (f)scanf (above) or a custom function.

Doing it yourself with fgets and realloc

Once all these options are exhausted (GNU libc or readline is unavailable, you're trying to read from a file rather than a terminal, ...), the only option left is to roll your own getline function. Fortunately this is quite easy, as you can combine two completely standard library functions: fgets and realloc.

realloc is like malloc, except that it resizes blocks of memory instead of creating new ones. When you call it on a block, the contents of the block are preserved.

fgets reads from a stream (i.e. a FILE *) a whole line of text, unless it runs out of space or hits EOF first. So, it can only read in so much text before it has to give up. This means that, provided you do not tell it there is more space in the buffer than there really is, you will not get buffer overflows this way. You can get around this restriction by defining a function which reads from the stream repeatedly, getting more memory as needed, until finally the whole line has been read.

Here is one possible implementation:

#include <stdio.h>

char * getline(FILE * f)
{
size_t size = 0;
size_t len = 0;
size_t last = 0;
char * buf = NULL;

do {
size += BUFSIZ; /* BUFSIZ is defined as "the optimal read size for this platform" */
buf = realloc(buf,size); /* realloc(NULL,n) is the same as malloc(n) */
/* Actually do the read. Note that fgets puts a terminal '\0' on the
end of the string, so we make sure we overwrite this */
fgets(buf+last,size,f);
len = strlen(buf);
last = len - 1;
} while (!feof(f) && buf[last]!='\n');
return buf;
}

This is not quite as efficient as it could be; optimisation is left as an exercise for the reader.

Page last modified by pwb on Tue, 24 Sep 2019 11:52:11 +0000